1 Introduction

Pericytes are a cell type that enwraps small blood vessels throughout all vascularized tissues of the body. This cell type is thought to have the potential to profoundly influence clinical outcomes in chronic disease by modulating blood vessel behavior. However, pericytes are difficult to study because they lack exclusive cellular markers, making identification and isolation difficult. Currently, pericytes are identified by a panel of markers, their cellular shape, and their perivascular position relative to capillaries (Fig 1) (1-3). Recently, a study by the Yuan Lab [1] at Harvard Medical School sought to identify novel pericyte-specific markers using single cell RNA-seq from the Tabula Muris dataset. The Tabula Muris is a compendium of single cell transcriptome data from the model organism Mus musculus, containing nearly 100,000 cells from 20 organs and tissues [2].

Figure 1: Diagram of various cell types found in vascularized tissues. Smooth muscle cells, pericytes, fibroblasts, and mesenchymal stem cells all share similiar marker expression profiles and are often identified by their position relative to the vasculature and cell shape.
Figure 1: Diagram of various cell types found in vascularized tissues. Smooth muscle cells, pericytes, fibroblasts, and mesenchymal stem cells all share similiar marker expression profiles and are often identified by their position relative to the vasculature and cell shape.


The authors used UMAP to visualize clusters of single cell gene expression based on a euclidean distance kNN graph. Pericyte associated clusters were identified with high co-expression of two pericyte markers, CSPG4 and PDGFRB. A subclass of cells in this group with exceptionally high co-expression were identified as “stringent” pericytes. Differential gene expression analysis of stringent pericytes against all other cell types yielded a panel of novel candidate pericyte markers. While this study identified interesting new potential pericyte markers for scientific research, there are potential queries with the methodology used in the processing pipeline:

  1. Data Integration: the authors combined Tablua Muris datasets from various tissues into a single dataset using the IntegrateData() function in Suerat. Data integration is useful for eliminating batch effects and other uninteresting sources of variation when combining data across replicates of the same tissue (horizontal integration), or combining measurements from different technologies of the same cells (vertical integration), this study combines data from different organs. Such an integration process is not required for identifying cell markers across tissues- markers can be identified for each tissue individually without any integration. There is a possibility that using integration could drown out signal in tissues where the overall expression of the cell markers of interest is less than average.
  2. Clustering Effectiveness: the authors do not determine if their clustering analysis was effective in separating distinct cell populations within the tissue. Clusters could be evaluated based on their composition of cell types using manual expert annotations provided from Tabula Muris.
  3. Identity of Pericyte Cluster: the cluster that had high expression of CSPG4 and PDGFRB markers were designated as pericytes for this study, but these markers are also expressed in other cell types (1-3). It remains unknown what cell type the specific pericyte designated clusters were assigned according to Tabula Muris.

These queries motivate the following research questions:
1. Will the same overall trends in clustering still appear if the datasets are not integrated across tissue types?
2. Does the clustering approach used in the study effectively delineate between the manually annotated cell types from Tabula Muris?
3. Do the pericyte designated clusters align with the manual annotations that already exist in Tabula Muris?


Primary Paper and Dataset References
[1] S.-H. Baek, E. Maiorino, H. Kim, K. Glass, B. A. Raby, K. Yuan, Single Cell Transcriptomic Analysis Reveals Organ Specific Pericyte Markers and Identities. Frontiers in Cardiovascular Medicine. 9 (2022) (available at https://www.frontiersin.org/articles/10.3389/fcvm.2022.876591).
[2] The Tabula Muris Consortium., Overall coordination., Logistical coordination. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018). https://doi.org/10.1038/s41586-018-0590-4.

Primary Packages: Seurat, AnnotationHub, ggplot, tidyverse, pheatmap.
Supporting Packages: cowplot, RColorBrewer, BiocManager, stringr

>> Code for loading packages and initializing parallel processing.

# Note: this code assumes that the working directory is set to the source file 
# location

# Install required Base Packages
base_packages <- c("Seurat", "ggplot2", "tidyverse", "cowplot", "RColorBrewer",
                   "BiocManager", "stringr", "pheatmap", "future", "rmdformats",
                   "memuse")
install.packages(setdiff(base_packages, rownames(installed.packages())))  
# Install required Bioconductor Packages
biocm_packages <- c("AnnotationHub", "ensembldb")
bioc_installs <- setdiff(biocm_packages, rownames(installed.packages()))
if (length(bioc_installs)) {BiocManager::install(bioc_installs) }

# Load base packages
lapply(base_packages, library, character.only = TRUE)
# Load Bioconductor packages packages
lapply(biocm_packages, library, character.only = TRUE)

# Setup multicore processing
plan(multisession, workers = max(c(parallelly::availableCores()-2,1)))
# Set max ram to 90% of available
options(future.globals.maxSize = as.numeric(memuse::Sys.meminfo()[[2]]) *.9)

2 Methods

For the purposes of clarity and reproducibility, below are summaries of the processing pipeline used in this project, an explanation of the gene annotation database, supporting files necessary to carry out the analysis, and the source code.

2.1 Pipeline


Figure 2.1: Porcessing of pipeline for this study. Steps connected by white arrows are considered standard workflow for seurant, while grey arrows are custom analysis for this projects.
Figure 2.1: Porcessing of pipeline for this study. Steps connected by white arrows are considered standard workflow for seurant, while grey arrows are custom analysis for this projects.


Below are function parameters that were used in the reference publication for processing. 1. NormalizeData(normalization.method = “LogNormalize”, scale.factor = 10000)
2. FindVariableFeatures(selection.method = “vst”, nfeatures = 2000)
3. ScaleData(model.use = “linear”,)
4. RunPCA()
5. Dimensionality: Bladder: 30, Kidney: 45, Lung and Heart: 45 5. FindNeighbors(dims = 1:dimensionality)
6. FindClusters(resolution = 0.5, algorithm=1)
7. RunUMAP()


2.2 Gene Annotation


Figure 2.2: Conversion of ensemble IDs to gene names was carried out using the latest annotation files for Mus musculus in the AnnotateHub Bioconductor package.
Figure 2.2: Conversion of ensemble IDs to gene names was carried out using the latest annotation files for Mus musculus in the AnnotateHub Bioconductor package.




Seurat single cell datasets typically record gene expression by their ensemble ID which are unique static codes to identify features in the genome. However, they are not human readable, so we must convert ensemble IDs to the gene name with an annotated gene database. Processing pipelines typically use bioMart or AnnotationHub as a reference database. AnnotationHub was used because it maintains a 1:1 mapping from ensemble ID’s to gene names, while bioMart does not because it can have a one to many relationship between ensemble IDs and entrez IDs.

>> Code for annotating genes with gene name.

# Check for gene annotation file
if (!file.exists(paste0(getwd(),"/data/annotations_Mm.csv"))) {
# Connect to Annotation Hub as the source for gene annotations
ah <- AnnotationHub()
# Access the Ensemble database for organism
ahDb <- query(ah, pattern = c("Mus musculus", "EnsDb"),    ignore.case = TRUE)
# Acquire the latest annotation files
id <- ahDb %>% mcols() %>% rownames() %>% tail(n = 1)
# Download the appropriate Ensembldb database
edb <- ah[[id]]

# Extract gene-level information from database
annotations <- genes(edb, return.type = "data.frame")
# Select annotations of interest
annotations <- annotations %>%
  dplyr::select(gene_id, gene_name, seq_name, gene_biotype, description)

# Export table to csv
data.table::fwrite(annotations, file = paste0(getwd(),"/data/annotations_Mm.csv"))
}
# Load ensembl <-> gene annotation conversion table
gene_annotations <- data.table::fread(file = paste0(getwd(),"/data/annotations_Mm.csv"),
                                      header = TRUE)

2.3 Datafiles


Figure 2.3: Tabula Muris Atlas.

Seurat datafiles provided by the Tabula Muris were used in this analysis for bladder, heart, kindey, and lung tissues. Data sets were renamed to [tissue].rds and saved in the /data/ subfolder of the project repository.

Direct Link to Seurat datafiles found on Figshare.

  1. bladder.rds
    348 MB, MD5 checksum: 12349521850bed75aaba9a4abb5daf48

  2. heart.rds
    81 MB, MD5 checksum: f4740230d092fe5c072a3eacb5f533e6

  3. kidney.rds
    378 MB, MD5 checksum: afae3d5895401126c51090f5d6f9299a

  4. lung.rds
    752 MB,MD5 checksum: f6a12d88b19e7ecd94256c0f0f0ed284


2.4 Code

Note: Tabula Muris already performed based QC with the data, such as exlcuding cells with a high fraction of mitochondrial DNA, too few or to many features in a cell, and features in too few cells.

>> Code for processing Seurat pipeline.

# Get dataset filenames (named by tissue type, see Section 2.3.1
data_filenames <- str_replace(list.files(paste0(getwd(),'/data/'),pattern='rds',
                                         full.names = F), '.rds','')[c(4,2,3,1)]


# Specify dimensions for linear dimensional reduction (used by authors in paper)
dim_thresh <- c(30,35,45,35)
# Specify clusters of interest for each tissue
clusters_oi <- list(13,c(11,10),19,24)

# Load all Seurat objects into list
seurat_objs <- list()

# Process and cluster each tissue data set individually (do not integrate)
if (!file.exists(paste0(getwd(),"/results/seurat_objs.rdata"))) {
  for(x in seq_along(data_filenames)){
    # Load seurat object
    seurat_obj  <- readRDS(paste0(getwd(),'/data/', data_filenames[x], '.rds'))

    # Create cleaned/pretty cell type labels
    clean_labels <- function(x) 
      str_replace_all(str_replace_all(x, paste0(data_filenames[x]," "),""), 
                      c("cell "="", "^nan$" = "unknown"))
    seurat_obj@meta.data$cell_label <- 
      factor(str_to_title(clean_labels(seurat_obj@meta.data$free_annotation)))
    
    # Normalize expression by fraction of total, * 10,000, and log-transform
    seurat_objs[[x]] <- NormalizeData(object = seurat_obj)
    
    # Features selection: id features with highest single cell variation (used for PCA)
    seurat_objs[[x]] <- FindVariableFeatures(object = seurat_objs[[x]])
    
    # Eliminate uninteresting sources of variation (batch effects, technical noise, cell cycle)
    # By regressing out these signals with linear models and outputting the residuals
    seurat_objs[[x]] <- ScaleData(object = seurat_objs[[x]])
    
    # Perform PCA on residuals to compress dataset and reduce noise from inconsequential genes
    # (Using highly variable features identified earlier)
    seurat_objs[[x]] <- RunPCA(object = seurat_objs[[x]])
    
    # Automated alternative for determining number of PCs to include
    # JackStrawPlot(object = seurat_objs[[x]], PCs = 1:50)
    
    # Construct KNN graph refined with Jaccard similarity between local neighborhoods
    seurat_objs[[x]] <- FindNeighbors(object = seurat_objs[[x]], dims = 1:dim_thresh[x])
    # Cluster cells with Louvain algorithm (default)
    seurat_objs[[x]] <- FindClusters(object = seurat_objs[[x]], resolution = 0.5)
    # Run non-linear dimension reduction to reveal underlying manifold of the data
    seurat_objs[[x]] <- RunUMAP(object = seurat_objs[[x]], dims = 1:dim_thresh[x])
 

  }
  # Export seurat objects to avoid recomputation
  names(seurat_objs) <- data_filenames
  save(seurat_objs, file = paste0(getwd(),"/results/seurat_objs.rdata"))
  
} else {
  # If seurat datafiles already computed, load from disk
  load(paste0(getwd(),"/results/seurat_objs.rdata"))
  
}

>> Code for data visualizations.

# Visualizations of dimensions, clusters, and gene expression
# Objects below are all ggplot objects
gg_elbows <- list()     # elbow plots
gg_umaps <- list()      # unsupervised umap plots
gg_dots <- list()       #  avg expression dot plots
gg_umap_dot <- list()   # merged umap and dot plots (above vars)
gg_annot_umap = list()  # annotated umap plot with tabula muris cell types
gg_annot_table= list()  # composiiton of umap clusters by tabula muris cell types
# Visualizations
for(x in seq_along(data_filenames)){
  
  # Convert cell type labels 
  remove_list = c("Epcam","Pecam", data_filenames[x], "cell")
  # 
  clean_labels = function(k) str_replace_all(str_replace_all(k, paste(remove_list,collapse="|"),""),
                                             c("\\s+"=" ", "^nan$" = "unknown"))
  seurat_objs[[x]]@meta.data$cell_label <-
    factor(str_to_title(str_trim(clean_labels(seurat_objs[[x]]@meta.data$free_annotation))))
  
  # Determine number of PCs to use
  gg_elbows[[x]] <- ElbowPlot(seurat_objs[[x]], ndims = 50, reduction = "pca") + 
    ggtitle(tools::toTitleCase(data_filenames[x]), ) +
    geom_vline(xintercept = dim_thresh[x], color = "red") +
    theme(plot.title = element_text(hjust = 0.5))
  save_plot(paste0(getwd(),'/results/elbow_', data_filenames[x], '.png'),
            gg_elbows[[x]], base_width = 3, base_height = 3)
  
    
  # Visualize clusters
  gg_temp <- DimPlot(seurat_objs[[x]], reduction = 'umap', label = FALSE, repel = TRUE) +
    ggtitle(sprintf("%s: %s Total Cells", tools::toTitleCase(data_filenames[x]), 
                    format(dim(seurat_objs[[x]])[2],big.mark=",", trim=TRUE)))  +
    theme_void() + theme(legend.position="none",plot.title = element_text(size=12,hjust = 0.5))
  gg_umaps[[x]] <- LabelClusters(gg_temp, id = "ident",  fontface = "bold", size = 6)
  
  
  # Create a dot plot of expression of PC markers
  gg_dots[[x]] <- DotPlot(seurat_objs[[x]], features = c("ENSMUSG00000032911","ENSMUSG00000024620")) + 
    scale_x_discrete(labels=c('CSPG4', 'PDFRB'),position = "top")  + xlab("") + ylab("Cluster") +
    scale_y_discrete(position = "right") + #scale_color_gradient(title) +
    scale_size_area(breaks=c(25,50,75), limits = c(0, 75)) +
    scale_color_gradient(breaks = rev(c(0,1,2)), limits = c(0, 2.5), low = "#D1CFD4", high = "#1E0BFE")+
    guides(size=guide_legend(title="% Exp."), colour = guide_colourbar(title="Mean Exp.")) +
    theme(text=element_text(size=12),plot.margin = unit(c(0, 0, 0, 0), "cm"),
          axis.text.x = element_text(size =12,angle = 45, hjust =0), axis.text.y = element_text(size =12),
          axis.ticks=element_blank(),panel.grid.major=element_blank(),
          legend.position = "none",
          panel.grid.minor = element_blank())
  
  # Create grid of cluster and expression table, then add legend at bottom
  gg_cluster_expr <- plot_grid(gg_umaps[[x]], gg_dots[[x]], ncol = 2, rel_widths = c(3, 1))
  # Add a second row with a legend
  legend <- get_legend(gg_dots[[x]]+theme(legend.position = "bottom",legend.justification="center"))
  gg_umap_dot[[x]] <- plot_grid(gg_cluster_expr, legend, nrow = 2, rel_heights = c(1, .1))
  save_plot(paste0(getwd(),'/results/umap_organ_', data_filenames[x], '.png'),
            gg_umap_dot[[x]], base_width = 8, base_height = 6)
  
  
  # Visualize Cluster with cell type annotated by atlas
  gg_annot_umap[[x]] <- DimPlot(seurat_objs[[x]], reduction = 'umap', group.by = "cell_label", 
                                label = FALSE, repel = TRUE) +
    ggtitle(sprintf("%s: %s Total Cells", tools::toTitleCase(data_filenames[x]), 
                    format(dim(seurat_objs[[x]])[2],big.mark=",", trim=TRUE)))  +
    theme(legend.position="bottom",text = element_text(size = 9),
          plot.title = element_text(size=12,hjust = 0.5), legend.key.size = unit(0, 'cm'), 
          legend.spacing = unit(0.1, 'cm'), legend.margin = unit(0.1, 'cm')) +
    guides(color=guide_legend(nrow=c(8,3,5,3)[x], byrow=TRUE, override.aes = list(size=3))) 
  gg_annot_umap[[x]]
  
  
  # Calculate percentage of cell type per cluster
  tab <- table(Assigned=seurat_objs[[x]]@meta.data$cell_label, Clusters=seurat_objs[[x]]$seurat_clusters)
  perc_tab <- sweep(tab, 2, colSums(tab ), FUN = '/')
  gg_annot_table[[x]] <- pheatmap(perc_tab, color = colorRampPalette(c('white','blue'))(10),
                                  cluster_rows = F, cluster_cols = F, fontsize = 9)
  
  # gg_annot[[x]] <- plot_grid(gg_annot_umap[[x]], gg_annot_table[[x]][[4]], nrow = 2, rel_widths = c(1, 1))
  # save_plot(paste0(getwd(),'/results/umap_annot_', data_filenames[x], '.png'),
  #           gg_annot[[x]], base_width = 2*6, base_height = 2*6)
  
} 

2.5 Dimensionality

A significant step in processing single cell gene expression data is the linear dimension reduction that compresses datapoints into hyperplanes formed by a linear combination of the features within the dataset. This data used the same dimensional thresholds that the authors used in the original study because the thresholds seemed reasonable (determined with tissue-specific elbow plots that visualized variation captured by each dimension).

2.5.1 Lung


Figure 2.5.1: Standard deviation captured by increasing dimensions of linear reduction of the lung gene expression data with PCA. Authors set the dimensionality for the bladder dataset to 30 dimensions.


2.5.2 Heart


Figure 2.5.2: Standard deviation captured by increasing dimensions of linear reduction of the heart gene expression data with PCA. Authors set the dimensionality for the heart dataset to 35 dimensions.


2.5.3 Kidney


Figure 2.5.3: Standard deviation captured by increasing dimensions of linear reduction of the kidney gene expression data with PCA. Authors set the dimensionality for the kidney dataset to 45 dimensions.


2.5.4 Bladder


Figure 2.5.4: Standard deviation captured by increasing dimensions of linear reduction of the bladder gene expression data with PCA. Authors set the dimensionality for the lung dataset to 35 dimensions.


3 Results

To investigate the effect of leaving data unintegrated on pericyte marker analysis, the data was reprocessed in Seurat following the standard workflow (Fig 2.1). The same settings were used as the original study, including the dimensionality chosen for each dataset (meaning the same number of PCA principal components, Fig 2.4.1-4). UMAP clustering was used to visualize the data with average expression of pericyte markers CPSG4 and PDGFRB visualized, along with a heatmap of the cell type composition of each cluster according to annotations from the Tabula Muris atlas.

3.1 Lung

Figure 3.2.1A: UMAP clustering of gene expression from single cells isolated from lung issue, paired with average expression of pericyte markers CSPG4 and PDGFRB for each cluster. Similar to the results from the paper, the 156 cells comprising cluster 13 expressed both CSPG4 and PDGFRB at high levels and in a large fraction of cells (compared to 284 cells from cluster 16 in the study).

Figure 3.2.1B: Heatmap of cluster cell type composition using annotations from the Tabula Muris atlas.


3.2 Heart

Figure 3.2.2A: UMAP clustering of gene expression from single cells isolated from heart issue, paired with average expression of pericyte markers CSPG4 and PDGFRB for each cluster. Similar to the results from the paper, the 394 cells comprising clusters 10 and 11 expressed both CSPG4 and PDGFRB at high levels and in a large fraction of cells (compared to 392 cells from clusters 10 and 12 in the study).

Figure 3.2.2B: Heatmap of cluster cell type composition using annotations from the Tabula Muris atlas.


3.3 Kidney

Figure 3.2.3A: UMAP clustering of gene expression from single cells isolated from kidney issue, paired with average expression of pericyte markers CSPG4 and PDGFRB for each cluster. Similar to the results from the paper, the 229 cells comprising cluster 19 expressed both CSPG4 and PDGFRB at high levels and in a large fraction of cells (compared to 358 cells from cluster 17 in the study).

Figure 3.2.3B: Heatmap of cluster cell type composition using annotations from the Tabula Muris atlas.


3.4 Bladder

Figure 3.2.4A: UMAP clustering of gene expression from single cells isolated from lung issue, paired with average expression of pericyte markers CSPG4 and PDGFRB for each cluster. Similar to the results from the paper, the 62 cells comprising cluster 24 expressed both CSPG4 and PDGFRB at high levels and in a large fraction of cells (compared to 345 cells from clusters 10 and 11 in the study).

Figure 3.2.4B: Heatmap of cluster cell type composition using annotations from the Tabula Muris atlas.


4 Discussion

Overall, the clustering approach without data integration yielded similar pericyte-like clusters as those found in the original study. However, for all tissues, there was a higher fraction of cells expressing both pericyte markers when the datasets were unintegrated, suggesting that avoiding integration resulted in more homogeneous and purified pericyte-like clusters. Compared to the original study, about 10% of cells in lung tissue expressed both markers (compared to 35%), 25% of cells in heart tissue (compared to 50%), 30% of cells in kidney tissue (compared to 50%), and 10% of cells in bladder (compared to 25%).

When compared to the cell type annotation provided by the Tabula Muris data, the unsupervised clustering performed reasonably well in separating them into distinct groups (Figure 3.2.[1-4]B). For lung tissue, clusters appeared to be split between cell types that have a high degree of similarity, such as T cell types or myeloid dendritic cells (Fig 3.2.1B). Similar trends were observed with heart tissue, although there were a multitude of cell types with unknown cellular origins for this tissue type (Fig 3.2.2B). For kidney tissue, overlap was mostly observed for immune cells and endothelial cells (Fig 3.2.3B). For bladder tissue, there was overlap with epithelial cell types, but otherwise most clusters did not map to multiple cell types (Fig 3.2.4B).

The cluster of cells that had high expression of CSPG4 and PDGFRB and deemed to be a source of pericytes did not always match their annotated cell type from the atlas. For lungs, the pericyte cluster were labeled as pericytes, but were labeled as smooth muscle cells in the heart, stroma mesangial cells in the kidney, and endothelial cells in the bladder. These discrepancies will require further follow up with the literature and more detailed marker expression of these groupings, and also highlights the difficulty in determining clean annotations for cell type.

Future Questions
1. Does that marker expression of the stringent pericyte cluster contain various pericyte markers shown in the literature?
2. Are there cells with marker expression profiles that match pericytes found in other clusters?
3. Will the same results be observed if the data is reprocessed with recent advances in transcriptomic analysis (4)?

5 References

  1. Riew TR, Jin X, Kim S, Kim HL, Lee MY. Temporal dynamics of cells expressing CSPG4 and platelet-derived growth factor receptor-β in the fibrotic scar formation after 3-nitropropionic acid-induced acute brain injury. Cell Tissue Res. 2021 Sep;385(3):539-555. doi: 10.1007/s00441-021-03438-3. Epub 2021 Apr 17. PMID: 33864501.

  2. Z. Jiang, T. Feng, Z. Lu, Y. Wei, J. Meng, C.-P. Lin, B. Zhou, C. Liu, H. Zhang, PDGFRb+ mesenchymal cells, but not CSPG4+ mural cells, contribute to cardiac fat. Cell Reports. 34, 108697 (2021).

  3. H. van Splunder, P. Villacampa, A. Martínez-Romero, M. Graupera, Pericytes in the disease spotlight. Trends in Cell Biology. 0 (2023), doi:10.1016/j.tcb.2023.06.001.

  4. Allan-Hermann Pool et al, Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references, Nature Methods (2023). DOI: 10.1038/s41592-023-02003-w



©2023 Bruce Corliss